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Abstract —Recently, it was shown that excellent results can be 
achieved in both face landmark localization and pose-invariant 
face recognition. These breakthroughs are attributed to the efforts 
of the community to manually annotate facial images in many 
different poses and to collect 3D faces data. In this paper, we 
propose a novel method for joint face landmark localization 
and frontal face reconstruction (pose correction) using a small 
set of frontal images only. By observing that the frontal facial 
image is the one with the minimum rank from all different 
poses we formulate an appropriate model which is able to 
jointly recover the facial landmarks as well as the frontalized 
version of the face. To this end, a suitable optimization problem, 
involving the minimization of the nuclear norm and the matrix 
norm, is solved. The proposed method is assessed in frontal face 
reconstruction (pose correction), face landmark localization, and 
pose-invariant face recognition and verification by conducting 
experiments on 6 facial images databases. The experimental 
results demonstrate the effectiveness of the proposed method. 

1. Introduction 


a) Input 



Frontal faces 


b) Minimize 
rank(L)+ A|E|o 

{l,cAp,e}' 

s.t. = L + E 


c) Output 




Fig. 1. Flowchart of the proposed method: a) Given an input image, the results 
from a detector, and a statistical model U, built on frontal images only, b) 
a constrained low-rank minimization problem is solved, c) Face alignment 
and frontal view reconstruction are performed simultaneously. Finally, d) face 
recognition is succeeded using the frontalized image. 


Face analysis is one of the most popular computer vision 
problems. Important topics in face analysis include generic 
face alignment [5], [39] and automatic face recognition [4], 
[40]. These problems have been considered as separate, both 
creating a wealth of scientific research in Computer Vision. In 
particular state-of-the-art face alignment and landmark local¬ 
ization methods [5], [39] model the problem discriminatively 
by capitalizing on the availability of annotated, with regards to 
facial landmarks, data [30], [31]. Unfortunately, the annotation 
of facial landmark is laborious, expensive, and time consuming 
process. This is more evident in cases where the face is not 
in frontal pose and some facial features and the boundary are 
neither visible nor well-defined.^ 

Existing methods for face alignment can be roughly divided 
into two main categories: (a) Holistic methods which use the 
whole texture of face as representation, and (b) part-based 
methods which represent the face by using a set of local 
image patches extracted around of the predefined landmark 
points. The most-well known methods from the first category 
are the Active Appearance Models (AAMs) [12], [25], [35] 
and the 3D Deformable Models (3DMs) [8]. In the second 
category, methods such as the Active Shape Models (ASMs) 
[13] and the Constrained Local Models (CLMs) [14], [32] 
are included. Many of the above mentioned face alignment 
methods have achieved state-of-the-art results (e.g., [5], [39]) 
in facial landmark localization under in-the-wild conditions 


^From experience we know that annotation of facial image with poses take 
in many cases twice the time compared with frontal poses. 


but they are trained on many annotated samples from various 
poses. 

On the other hand, in the majority of face recognition 
systems the first and arguably, most important, step of face 
alignment is taken for granted using off the self methods 
[38], [39]. Even in the recent state-of-the-art face recognition 
methods, where millions of image are used to train feature 
extractors and classifiers, the pivotal step that increases their 
performance is that of face alignment [34], [40]. In such 
cases, the alignment step is very elaborate requiring to both 
locate landmarks and use 3D face models for pose corrections. 
In general, 3D model-based methods have high recognition 
accuracy due to the incorporation of the 3D model. However, 
such methods cannot by widely applied since they require: (a) 
a method for accurate landmark localization in various poses, 
(b) to fit learned 3D model of faces, which is expensive to 
built, and (c) to develop robust image warping algorithms to 
reconstruct the frontal image [34] . A recent approach that does 
not require a 3D model but only a small set of landmarks is 
presented in [18]. This method aims to reconstruct the virtual 
view of an non-frontal image by employing Markov Random 
Field. The main drawback of the aforementioned method is 
that for each non-frontal image an exhaustively batch-based 
alignment algorithm trained on frontal patches is applied. 
Clearly, such a procedure is time consuming. 

In this paper, motivated by the observation that the rank 
of a frontal facial image, due to the approximately structure 
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Fig. 2. The average value of nuclear norm computed based on neutral images 
of twenty subjects from MultiPIE database under poses —30° : 30°. The 
initial images and the warped ones of a subject are also depicted. 


of human face, is much smaller than the rank of facial images 
in other poses, we propose a unified framework for joint 
face frontalization (pose correction), landmark localization and 
(single sample) pose invariant face recognition. We show this 
can be achieved by using a model built from frontal images 
only. To validate the above observation ‘Neutral’ images of 
twenty objects from MultiPIE database under poses —30^ : 30^ 
were warped into a reference frame and the nuclear norm 
(convex surrogate of the rank) of each shape-free texture was 
computed. In Fig. 2 the average value of the nuclear norm for 
the different poses is depicted. Clearly, the frontal pose has 
the smallest nuclear norm value compared the corresponding 
value of the rest of the poses. The flowchart of the proposed 
method (coined as FAR - Face frontalization for Alignment 
and Recognition) is depicted in Fig 1 . 

The most closely related work to the proposed method 
is the Transform Invariant Fow-rank Textures (TIFT) method 
[43]. In TIFT, texture rectification is obtained by applying a 
global affine transformation onto a low-rank term, modelling 
the texture. By blindly imposing low-rank constraints without 
regularization, for non-rigid alignment opposite effects may 
occur. We applied the TIFT with a non-rigid facial shape model 
and its performance was very poor as it can be observed in 
Fig. 3. Recently, it was demonstrated [11], [29], that non- 
rigid deformable models cannot be straightforward combined 
with optimization problems [27] that involve low-rank terms 
without a proper regularization. To overcome this and ensure 


Fig. 3. Results produced by the (a) non-rigid TILT and (b) FAR. 

that unnatural faces will be not created, a model of frontal 


images is employed. In that sense, our method can be seen 
as a deformable TIFT model regularized within a frontal face 
subspace. 

Summarizing, the contributions of the paper are: 

• Technical contribution: We develop a joint landmark 
localization and face frontalization method by propos¬ 
ing a deformable and appropriately regularized TIFT. 

• Applications in computer vision: 

1) To the best of our knowledge this is the first 
generic landmark localization method which 
achieves state-of-the-art results using a model 

of frontal images only. 

2) It is possible to improve or match the state- 
of-the-art in unconstrained face recognition 
using only frontal faces and simple features 
for classification unlike other complex feature 
extraction procedures e.g., [33].^ 

II. Notation and preliminaries 

Throughout the paper, scalars are denoted by lower-case 
letters, vectors (matrices) are denoted by lower-case (upper¬ 
case) boldface letters i.e., x, (X). I denotes the identity matrix 
of compatible dimensions. The ith column of X is denoted by 
x^. A vector x G (matrix X G is reshaped into 

a matrix (vector) via the reshape operator : 7lmxn{^) = X G 
I^mxn^ (vec(X) = xgM"^-^). 

The £i and the £2 norms of x are defined as ||x||i = \^i I 

and ||x ||2 = respectively. The matrix £i norm is 

defined as ||X||i = 1 ^ 01 ’ where 1*1 denotes the absolute 

value operator. The Frobenius norm is defined as ||X||i? = 

Ei 'LjXly and the nuclear norm of X (i.e., the sum of 

singular values of a matrix) is denoted by ||X||^. X^ is the 
transpose of X. If X is a square matrix, X“^ is its inverse, 
provided that the inverse matrix exists. 

The warp function x(>V(p)) (X(>V(p))) denotes the 
warping of 2D coordinates arranged as vector (matrix) by a 
warp parameter vector p G where v is the number of the 
warping parameters, back to reference coordinate system. To 
simplify the notation x(p) (X(p)) will be used throughout 
the paper instead of x(>V(p)) (X(>V(p))). 

III. Proposed method 
A. Problem formulation 

In this paper, the goal is to recover the frontal view (i.e., 
L G of a warped facial image (i.e., X(p) G 

which is possibly corrupted by sparse error of large magnitude. 
Such sparse errors indicate that only a small fraction of the 
image pixels may be corrupted by non-Gaussian noise and 
occlusions. In particular, based on the observation that the 
frontal view of a face lies onto a low-rank subspace (please 
refer to Fig. 2), it can be expressed as a linear combination 
of a small number of orthonormal learned basis (i.e. U = 
[ui, U 2 ,..., u/e] G U^U = I) that span a generic 

^We note that we refer to the restricted protocol of the LEW [20] and not 
to the unrestricted which unfortunately we cannot compete since we do not 
have access to millions of annotated faces. 















clean frontal face subspace, that is L = ^mxn(ui)ci. 

Therefore, a warped corrupted image is written as: 

k 

X(p) = L + E = lZraxn{^i)<^i + E, (1) 

i=l 

where E is the sparse error matrix. 

To match nicely the specifications of the frontal image 
and the sparse error one can find the low-rank frontal image, 
the linear combination coefficients, the increments of warp 
parameters, and the error matrix by solving: 

argmin ||L||* + A||E||i 

L,E,c, Ap 

( 2 ) 

s.t. X(p) = L + E, L = yUmxniviiki, 

i=l 


sub-problems: 

Lt+i = argmin/:(Lt,ct,Apt,et, at, Bt) (5) 

Lt 

Ci,t+i =argmin£(Lt+i,C(, Ap(,e(,at,Bt) (6) 

Ci,t 

Apt+i =argmin£(Lt+i,Ct+i, Apt,et,at,Bt) (7) 

Apt 

et+i =argmin£(Lt+i,Ct+i, Apt+i,et,at,Bt) (8) 

et 

Solving for L.* Fixing the Ct, Apt, e^, a^, and B^, sub¬ 
problem (5) is reduced to: 

argmin||L||* + ir(B^(ft, 2 (L, c))) + ^||ft. 2 (L,c)|||,. (9) 

Lt 2 

The nuclear norm regularized least squared problem (9) has 
the following closed-form solution: 


where the nuclear norm ||L||^ and the norm ||E||i are 
utilized to promote low-rank on L and sparsity in E, while A is 
a positive parameter balancing the norms. The nuclear and 
norms are the closest convex surrogates to the natural criteria 
of rank [16] and cardinality [15] which are NP-hard in general 
to optimize [26], [37]. However, (2) is difficult to be solved 
due to the non-linearity of the constraint X(p) = L + E. To 
remedy this, a first order Taylor linear approximation is applied 
on the vectorized form of the constrained: x(p+Ap) ^ x(p) + 
J(p)Ap. where vec(X(p)) = vec(L + E) = Uc + e = x(p) 
and J(p) = Vx(p) is the Jacobian matrix with the steepest 
descent images as its columns. Consequently, (2) is written as: 

argmin ||L||* -j- A||E||i 

L,e,c,p (3) 

S.t. /ii(Ap, c, e) = 0, /i 2 (L,c) = 0, 

where /ii(Ap,c,e) = x(p) + J(p)Ap — Uc — e and 
/l2(L,c) = L - EzLl'^mxn(Ui)Q. 


B. Optimization 


To solve (3), the augmented Lagrangian [7] is introduced: 
£(L,c, Ap,e,a,B) = ||L||* + A||e||i + a^(/ii(Ap, c, e)) 

+ tr (B'^(/i 2 (L,c))) + |(||/ii(Ap,c,e)||^ + \\h 2 (L,c)fp), 

(4) 

where a and B the Lagrange multipliers for the equality con¬ 
straints in (3) and /i > 0 is a penalty parameter. By employing 
the Alternating Directions Method (ADM) of multipliers [7], 
(3) is solved by minimizing (4) with respect to each variable 
in an alternating fashion and finally the Lagrange multipliers 
are updated at each iteration as outlined in Algorithm 1 its 
derivation is provided next. 


Let t be the iteration index. Given the L^, c^, Apt, e^, a^, 
and Bt, the updates are computed by solving the following 


Lt+i — Dj_ 


^ ^ Rmxn {)Gt Et/ P^t 


li=l 


( 10 ) 


The singular value thresholding (SVT) operator is defined for 
any matrix Q with Q = UEV^ as D^[Q] = [9], 

with 5'r[c^] =sgn(cr) max(|cr| — r, 0) being the (element-wise) 
shrinkage operator [10]. 


Algorithm 1: Solving (4) by the ADM method 

Data: Test image X, initial deformation parameters p, 
clean face subspace U, and the parameter A 
Result: The low-rank clean image L, the sparse error e, 
the coefficient vector c, and the increments of 
the deformation parameters Ap. 
while not converged do 

Warp and normalize the image; 

Compute the Jacobian matrix; 

Initialize: Lq = 0, Cq = 0, Cq = 0, ao = 0, Bq = 0, 

/io > 0 , p > 1 ; 

while not converged do 
Update Lt+i by (10); 

Update by (12); 

Update Apt+i by (15); 

Update et+i by (17); 

Update the Lagrange multipliers by 

= at + /it(^i(Apt+i, Ct+i, ct+i)); 

Bt+i = Bt + /it((^2(Lt+i, Ct+i)); 

Update pt+i by pt+i ^ min(p • pt, 10^^); 
Check convergence conditions (18) and (19); 
t i — f “h 1; 

end 

p ^ p + Ap; 


Solving for c: Fixing the other variables, sub-problem (6) 
is reduced to: 

argmin af (/ii(Apt, Ct, Ct)) +tr (B^(/i 2 (Lt+i, Ct))) 

Ct 

+ ^ (ll^i(^Pt5 Ct, et)||2 + ||/i2(Lt+i, Ct)|||^). (11) 









(11) is a quadratic problem which for each G {1,.. .k} 
admits a closed form solution given by: 


i,t+l — 


afuj+tr(Bfi?mxn(ui)) + tr(L^ii?mxn(ui)) 


2^4 


where x = x(p) + J(p)Apt - ej. 


( 12 ) 


Solving for Ap; Sub-problem (7) is written as: 
argmin af(ft.i(Apt,C(+i,e*)) + ^||/ii(Ap(,cj+i,et)|| 2 . 

Apt ^ 

(13) 

By exploiting the fact that U is orthonormal, each part of 
(13) is decomposed into the term projected in the UU^ and 
the term projected into orthogonal complement I —UU^. The 
update of the Ap is obtained by minimizing the projected into 
orthogonal complement part, i.e., 

argmin af (I - UU'^)(x(pt) + J(p)Apt - et+i) 

Apt 

+ |l|x(p) + J(pt)Apt - (14) 

The solution of (14) is given by: 

Apt+i = -(J(p)^J(p))”^J(p)^(x(p) - et+i), (15) 

where J(p) is the projected Jacobian in I —UU^. To calculate 

efficiently the term J^J the following formulation is used: 
J^J = J^(I - UU)^J^ = (U^J)^(U^J). 

Solving for e; Using L^+i, c^+i, Ap^+i (8) is written as: 

argmin A||e||i + af ((ii(Apt+i, C(+i, et)) 

et 

+ |l|(ii(Apt+i,Ct+i,et)||^. (16) 

The closed-form solution of (16) is given by applying element¬ 
wise the shrinkage operator onto: x(pt) + J(p)Apt+i — 
UC(+i + namely 

et+i = [x(pt) + J(p)Apt - Uct + ■ (17) 


Convergence criteria: The inner loop of the Algorithm 1 
terminates when 

max(||et - e(_i||2/||x(p)||2, ||L( - Lt_i||F/||x(p)||2) < e2, 

(18) 

and 


max(||ft,i(Ap(+i,C(+i,e(+i)||2/||x(p)||2, 

||/l 2 (Lt+l,Ct+i)||F/||x(p)|| 2 ) < 63 . (19) 

The Algorithm 1 terminates when the change of the 
||L||^ + ^l|E||i between two successive iterations is smaller 
than the threshold ei or the maximum number of the outers’ 
loop iterations is reached. The dominant cost of each iteration 
of Algorithm 1 is that of the SVD algorithm involved in 
the computation of the SVT operator in update of L. Con¬ 
sequently, the computational complexity of Algorithm 1 is 
0(T(mm(m,n)^ + n^k)), where T is the total number of 
iterations until convergence. 


In Fig. 4, the convergence of the inner loop of Algorithm 
1 is depicted. The low-rank and error images produced after 
30, 50 and 117 iterations, respectively, are also shown. 



Fig. 4. The convergence curve of the Algorithm’s 1 inner loop. 


IV. Experimental Results 

The performance of the FAR is assessed in: a) frontal face 
reconstruction, b) landmark localization, and c) pose invariant 
face recognition and verification, by conducting experiments 
in 6 facial image databases, which are described briefly next. 

A. Data description 

LFPW: The Labeled Faces Parts in-the-wild (LFPW) [6] 
database contains images downloaded from the internet (i.e., 
gooogle.com, fiickr.com etc), images exhibiting multiple vari¬ 
ations such as pose, expression, illumination, and occlusions. 
Since only the URLs of images were provided, 811 out of the 
1,132 training images and 224 out of the 300 test images were 
downloaded. 

HELEN: The HELEN [22] database consists of 2,300 
images downloaded from Elickr web service, containing a 
broad range of appearance variation, including pose, lighting, 
expression, occlusion, and individual differences. The size of 
the face in each of the images was approximately 500 x 500 
pixels. 

AFW: The Annotated Faces in-the-wild (AFW) [44] 
database consists of 250 images with 468 faces. That is more 
than one faces are annotated in each image. The images 
exhibit similar variations with those in the LFPW and HELEN 
databases. 

FERET: The Facial Recognition Technology (FERET) 
[28] database consists of 14,051 images of 200 different 
subjects. All images capture the same ‘Neutral’ expression 
for 9 different poses under different illuminations, where each 
subject also has an additional image with a random facial 
expression. 

MultiPIE: The CMU Multi Pose Illumination and Ex¬ 
pression (MultiPIE) [17] database consists of approximately 
750,000 images from 337 subjects, captured under 6 different 
expressions, 15 poses, and 19 illuminations. 

LFW: The Labeled Faces in the Wild (LEW) [20] database 
contains 13,233 images of 5,749 people downloaded from 















the Web and is designed as a benchmark for the problem 
of unconstrained automatic face verification. All images are 
characterized by the existence of large pose, expression and 
occlusion variations. 

B. Experimental setup 

In all experiments, the orthonormal clean face subspace 
U was constructed by employing only frontal view without 
occlusions face images. In total 500 frontal images (217 from 
the training set of the LFPW and 283 from the training set 
of the HELEN databases) were selected to build the bases U. 
The frontal images were warped in a common frame (185 x 
193 pixels) by using a piece-wise affine motion model and 
subsequently the PCA was applied on the warped shape-free 
textures. The first k = 450 eigen-images were kept. Unless 
otherwise stated, throughout the experiments, the same U was 
used and the parameters of the Algorithm 1 were fixed: A = 
0.3, p = 1.1, po = 10“®, ei = 10-3, 62 = 10-5, and 63 = 
10 -'^. 

C. Frontal face reconstruction 

Next, the ability of the FAR to reconstruct frontal faces 
from non-frontal images of unseen subjects is investigated 
by using two unseen subjects taken from MultiPIE and from 
FERET databases and 5 in-the-wild images. 


In Fig. 5 (rows: 1-2) the reconstructed frontal faces from 
the non-frontal images (‘ba’, ‘be’, ‘bd’, ‘be’, ‘bf’, ‘bg’, and 
‘bh’) of ‘00268’ subject from FERET database are illustrated. 
Fig. 5 (rows: 3-4) depicts the frontal reconstructed views from 
the images taken from MultiPIE with (a) ‘Surprise’ at —30°, 
(b) ‘Scream’ at —15°, (c) ‘Squint’ at 0°, (d) ‘Neutral’ at 
+15°, and (e) ‘Smile’ at +30°. By visually inspecting Fig. 
5, it is clear that the FAR is robust to pose, expression, 
and lighting conditions variations. This attributed to the fact 
that the matrix ^i-norm was adopted for non-Gaussian noise 
characterization. Frontal reconstructed views from in-the-wild 
images are depicted in Fig. 5 (rows: 5-6). 

To quantitatively assess the quality of the frontalized im¬ 
ages the following experiment was conducted. To this end, 
‘Neutral’ images of 20 different subjects of MultiPIE under 
poses —30^ : 30^ (5 for each subject, 100 in total) were 
selected. The images of each subject were frontalized by 
employing the EAR. The Root Mean Square Error (RMSE) 
between each frontalized image and the real frontal image of 
the subject is used as evaluation metric. The performance of the 
EAR with respect to RMSE is compared with that obtained by 
the frontalization system of the DeepEace [34]. The average 
RMSEs of the EAR and DeepEace are 0.0817 and 0.1025, 
respectively. It is worth noting that, even DeepEace employs a 
3D model to handle out-of-planar rotations, the EAR performs 
better without using any kind of 3D information. 



MultiPIE 



In-the-Wild 


Fig. 5. Reconstructed frontal images of unseen subjects under controlled and 
in-the-wild conditions. 

Given an input facial image, the initialization was produced 
by applying the detector [44] . The image, the initialization and 
U were given as input into Algorithm 1 . By using the produced 
Ap the outer loop was rerun again for one iteration without 
updating Ap the execution of sub-problem (15). 


D. Face landmark localization 

The performance of the EAR in the generic face alignment 
problem is assessed by conducting experiments on in-the-wild 
databases namely, the LEPW, the HELEN and the AEW. To 
this end, the performance of the EAR is compared against that 
obtained by (a) the AAMs, the CLMs, and the SDM using 
exactly the same training data as well as the same features and 
(b) the state-of-the-art method and features. The annotations 
provided in [30], [31] have been employed for evaluation 
purposes. The average point-to-point Euclidean distance of 49 
interior landmark points (excluding the points correspond to 
face boundary) normalized by the Euclidean distance of the 
outer corner of eyes is used as the evaluation measure. In 
addition, the cumulative error distribution curve (CED) for 
each method is computed by using the fraction of test images 
for which the average error was smaller than a threshold. 

Same train set and features: In order to compare fairly 
the competing methods, the same training data, initialization, 
and features were employed. The 500 frontal images used to 
build the U, were used as the training set while the pixel 
intensities (Pis) were selected as the texture representation. The 
results produced by the detector [44] were used to initialize the 
methods. Eor this experiment the implementations provided by 
the platform MENPO [1] were used for all methods. 

The CEDs produced by all methods for the LEPW (test set), 
the HELEN (test set), and the AEW databases are depicted 
in Eig. 6 . Clearly, the EAR outperforms the AAMs-PIs, the 
CLMs-PIs, and the SDM-PIs. More specifically, for normalized 
error equal to 0.05^ the EAR yield an 20.1%, 21.5% and 24.6% 
improvement compared to that obtained by the AAMs-PIs 


^This value was found by visually inspecting the results. 
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Fig. 6. Cumulative error distribution curves produced by the CLMs-PIs, the AAMS-PIs, the SDM-PIs, and the FAR for the LFPW, the HELEN, and the AEW. 


LFPW HELEN AFW 



Fig. 7. Comparison of the cumulative error distribution curves obtained by the FAR and the SDM on the LFPW, the HELEN, and the AEW databases. 



AAMs-PIs CLMs-PIs SDM-PIs SDM-SIFT FAR-PIs 

Eig. 8. Sample fitting results from the LFPW (rows: 1-2), the HELEN (rows: 
3-4), and the AEW (rows: 5-6) databases. (The Figure is best viewed in color) 


across the test databases, respectively. A few fitting examples 
from the test databases are depicted in Fig. 8. 

State-of-the-art method and features: 

In this experiment, the FAR is compared against the state- 


of-the-art method SDM [39]. In particular, the implementation 
as well as the pre-trained model provided by the authors were 
used. Both the FAR and the SDM were initialized by using the 
same detector [44], and the SDM was initialized following the 
instructions of the authors'^. The CEDs from this experiment 
are shown in Fig. 7 where the FAR achieves comparable 
performance with that obtained by SDM using only 500 
frontal images. It is worth mentioning, the SDM was trained 
on thousand images captured under several variations including 
different poses, illuminations and expression. Furthermore, the 
SDM method takes full advantage of SIFT - a powerful hand¬ 
crafted feature - while the FAR employs only pixel intensities. 

E. Pose-invariant face recognition and verification 

The performance of the FAR on pose invariant face recog¬ 
nition and verification is assessed by conducting experiments 
on the MultiPIE, FERET, and LEW databases. Apart from the 
experimental results reported in this section more can be found 
on the supplementary materials. 

1) Pose invariant face recognition: The frontal views of 
all images were reconstructed following the methodology 
described in Section IV-C. The reconstructed images were 
cropped to remove the surrounding black pixels. The Image 
Gradient Orientations (IGOS) features [36] were used for 
image representation. The dimensionality of IGOs was reduced 
by applying PC A. The classifier in [41] was used. 

The performance of the EAR is compared against of that 
obtained by the following 2D based methods LGBP [42] and 
PIMRE [18], 3D based methods 3DPN [4], EGEC [24], and 
PAP [40] as well as the Deep learning based methods SPAE 
[21] and DIPPS [45]. It should be noticed that all methods 
were evaluated under the fully automatic scenario; where both 


^http://www.humansensing.cs.cmu.edu/intraface 































the bounding box of the face region and the facial landmark 
were located automatically. 

Results on FERET: One frontal image, denoted as ‘ba’, 
from each of the 200 subjects was used to form the gallery 
set, while the images captured at 6 different poses i.e., from 
—40° to 40° were selected as the probe images. 


TABLE L Recognition rates (%) achieved by the compared 
METHODS ON THE FERET DATABASE. 


Method 

bh 

-40° 

bg 

-25° 

bf 

-15° 

be 

+15° 

bd 

+25° 

be 

+40° 

Avg 

LGBP [42] 

90.5% 

98.0% 

98.5% 

97.5% 

97.0% 

91.9% 

95.6% 

3DPN [4] 

90.5% 

98.0% 

98.5% 

97.5% 

97.0% 

91.9% 

95.6% 

PIMRF [18] 

91.0% 

97.3% 

98.0% 

98.5% 

96.5% 

91.5% 

95.5% 

PAF [40] 

98.0% 

98.5% 

99.25% 

99.25% 

98.5% 

98.0% 

98.56% 

FAR 

96.5% 

99.0% 

100.0% 

100.0% 

100% 

96% 

98.58% 


In Table I the recognition rates achieved by the competing 
methods in the different poses are reported. Clearly, the FAR 
(recognition accuracy 98.58%) outperforms both the 2D and 
3D state-of-the-art methods PIMRF and PAF, respectively. It is 
worth mention that the PIMRF employs 200 images from the 
FERET database (different from the test set) in order to train 
the frontal synthesizer. Consequently, the different lighting 
conditions of the database are taken into account. This is not 
the case for the FAR where only frontal images from a generic 
in-the-wild database (i.e., the LFPW and HELEN) have been 
used. Even the FAR does not use any kind of 3D information it 
outperforms the PAF where an elaborated 3D model (trained 
from 4.624 facial scans) has been used to find the 3D pose 
and extract pose adaptive features. The reported results of the 
EGFC [24] were not included in Table I as they were obtained 
using a semi-automatic protocol (i.e., 5 manually annotated 
landmark points used). 

Results on MultiPIE: The images of the 137 subjects 
(Subject ID 201: 346) with ‘Neutral expression and poses 
—30° : +30° captured under 4 different sessions were selected. 
The gallery was created by the frontal images of the earliest 
session for each subject, while the rest images including frontal 
and non-frontal views were used as probes. It should be 
mentioned that images of first 200 subjects which include 
all poses (4207 in total) were used for training purposes. In 
particular, the above mentioned images were used in the 3DPN 
to train view-based models, in the SPAE, DIPES to train the 
deep networks, and in the EGEC to train the pose estimator 
and matching model. The recognition accuracy achieved by 
the just mentioned methods is reported in Table II. Again, the 
EAR outperforms four out of five methods that is compared to, 
verifying the high quality of the frontalized images. The EAR 
also performs comparable with the DIPES by simply using just 
500 frontal images outside the MultiPIE. 

2) Face verification: The performance of the EAR in the 
face verification under in-the-wild conditions is assessed by 
conducting experiment in the LEW database, using the ‘image- 
restricted, no outside data results’ protocol. The reported 
results are obtained using 10-fold cross validation. 

In this experiment the basis U and the detector [44] were 
not used since they based on images outside the database. To 
create the initializations and a new the method for 

automatic construction of deformable models presented in [2] 


TABLE IT Recognition rates (%) achieved by the compared 
METHODS ON THE MULTiPIE DATABASE. 


Method 

130_06 

-30° 

140 06 
-15° 

051_07 

0° 

050 08 
15° 

041_08 

30° 

Avg 

PIMRF [18] 

89.7% 

91.7% 

92.5% 

91.0% 

89.0% 

90.78% 

3DPN [4] 

91.0% 

95.7% 

96.9% 

95.7% 

89.5% 

93.76% 

SPAE [21] 

92.6% 

96.3% 

- 

95.7% 

94.3% 

94.72% 

EGFC [24] 

95.0% 

99.3% 

- 

99.0% 

92.9% 

96.55% 

DIPES [45] 

98.5% 

100% 

- 

99.3% 

98.5% 

99.07% 

FAR 

94.3% 

98.7% 

99.4% 

97.3% 

95.6% 

97.06% 


was employed. The goal of this method is to build a deformable 
model using only a set of images with the corresponding 
bounding boxes. To define the bounding boxes without using 
a pre-trained detector, the deep funneled images of the LEW 
[19] were employed. Therefore, since these images are aligned 
the exact bounding box is known. Subsequently, a deformable 
model was built automatically from the training images of 
each fold. The created model was fitted to all images and 
those (from training images) with fitted shapes similar to mean 
shape were selected to build the bases Ulfvl- In each fold 
the images were frontalized using the \^lfw and they were 
cropped next. In the sequel, the gradient orientations 02 
of each image pair were extracted and the cosine of difference 
between them A0 = 0^—02 normalized to the range [0 — 27r], 
was used as the feature of the pair. These features are classified 
by a Support Vector Machine (SVM) with an RBE kernel. The 
performance of the EAR is compared against that obtained 
by the MRE-MLBP [3], Eisher Vector Eaces [33] and the 
EigenPEP [23] methods^. The mean classification accuracy and 
the corresponding standard deviation on LEW are reported in 
Table III. By inspecting Table III the EAR outperforms the 
MRE-MLBP and the Eisher Vector Eaces and performance 
comparably with the recently published method EigenPEP. 


MRE-MLBP [3] 

0.7908 ± 0.0014 

Fisher vector faces [33] 

0.8747 ± 0.0149 

EigenPEP [23] 

0.8897 + 0.0132 

FAR 

0.8881 ± 0.0078 


TABLE III. Mean classieication error and standard 
DEVIATION ON THE LEW DATABASE. 


V. Conclusions 

In this paper we developed the first, to the best our 
knowledge, method that jointly performs landmark localization 
and face frontalization using only a simple statistical model of 
few hundred frontal images. The proposed method outperforms 
state-of-the-art methods for face landmark localization that 
were trained on thousands of images in many poses and 
achieves comparable results in pose invariant face recognition 
and verification without using 3D elaborate models or Deep 
Learning-based features extraction. 
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